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Abstract 

o ' 

. Solving linear regression problems based on the total least-squares (TLS) criterion has well-documented 

merits in various applications, where perturbations appear both in the data vector as well as in the 
' regression matrix. However, existing TLS approaches do not account for sparsity possibly present in the 

unknown vector of regression coefficients. On the other hand, sparsity is the key attribute exploited by 
modern compressive sampling and variable selection approaches to linear regression, which include noise 
in the data, but do not account for perturbations in the regression matrix. The present paper fills this 
5/5 ' gap by formulating and solving TLS optimization problems under sparsity constraints. Near-optimum 

and reduced-complexity suboptimum sparse (S-) TLS algorithms are developed to address the perturbed 
compressive sampling (and the related dictionary learning) challenge, when there is a mismatch between 
■ the true and adopted bases over which the unknown vector is sparse. The novel S-TLS schemes also 

operator (Lasso), and endow TLS approaches with ability to cope with sparse, under-determined "errors- 



OO 

O 

o 



X 



in-variables" models. Interesting generalizations can further exploit prior knowledge on the perturbations 
to obtain novel weighted and structured S-TLS solvers. Analysis and simulations demonstrate the practical 
impact of S-TLS in calibrating the mismatch effects of contemporary grid-based approaches to cognitive 
radio sensing, and robust direction-of-arrival estimation using antenna arrays. 
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I. Introduction 

Sparsity is an attribute possessed by many signal vectors either naturally, or, after projecting them 
over appropriate bases. It has been exploited for a while in numerical linear algebra, statistics, and signal 
processing, but renewed interest emerged in recent years because sparsity plays an instrumental role in 
modern compressive sampling (CS) theory and applications; see e.g., 0. 

In the noise-free setup, CS holds promise to address problems as fundamental as solving exactly 
under-determined systems of linear equations when the unknown vector is sparse (U. Variants of CS 
for the "noisy setup" are rooted in the basis pursuit (BP) approach ifTTTl . which deals with fitting sparse 
linear representations to perturbed measurements - a task of major importance for signal compression 
and feature extraction. The Lagrangian form of BP is also popular in statistics for fitting sparse linear 
regression models, using the so-termed least-absolute shrinkage and selection operator (Lasso); see e.g., 
11281 . |[T9l , and references thereof. However, existing CS, BP, and Lasso-based approaches do not account 
for perturbations present in the matrix of equations, which in the BP (respectively Lasso) parlance is 
referred to as the representation basis or dictionary (correspondingly regression) matrix. 

Such perturbations appear when there is a mismatch between the adopted basis matrix and the actual 
but unknown one - a performance-critical issue in e.g., sparsity-exploiting approaches to localization, 
time delay, and Doppler estimation in communications, radar, and sonar applications |[T6l . |[22l . ll6l . El . 
Performance analysis of CS and BP approaches for the partially-perturbed linear model with perturbations 
only in the basis matrix, as well as for the fully-perturbed one with perturbations present also in the 
measurements, was pursued recently in EU1 . lfT2l . and ifTOl . But devising a systematic approach to 
reconstructing sparse vectors under either type of perturbed models was left open. 

Interestingly, for non-sparse over-determined linear systems, such an approach is available within the 
framework of total least-squares (TLS), the basic generalization of LS tailored for fitting fully-perturbed 
linear models l30l . TLS and its variants involving regularization with the ^-norm of the unknown 
vector (261, have found widespread applications in diverse areas, including system identification with 
errors-in-variables (EIV), retrieval of spatial and temporal harmonics, reconstruction of medical images, 
and forecasting of financial data j23l . TLS was also utilized by |fT3l for dictionary learning, but the 
problem reduces to an over-determined linear system with a non-sparse unknown vector. Unfortunately, 
TLS approaches, with or without existing regularization terms, cannot yield consistent estimators when 
the linear model is under-determined, nor they account for sparsity present in the unknown vector of 
regression coefficients. 
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From a high-level vantage point, the present paper is about fitting sparse, perturbed, linear models, 
through what is termed here the sparse (S-) TLS framework. On the one hand, S-TLS provides CS, 
BP, and Lasso-type algorithms suitable for fitting partially- and fully-perturbed linear models. On the 
other hand, it furnishes TLS with sparsity-cognizant regularized alternatives, which yield consistent 
estimators even for under-determined models. The novel framework does not require a priori information 
on the underlying perturbations, and in this sense S-TLS based algorithms have universal applicability. 
However, the framework is flexible enough to accommodate both deterministic as well as probabilistic 
prior information that maybe available on the perturbed data. The practical impact is apparent to any CS, 
BP/Lasso, and TLS-related application involving reconstruction of a sparse vector based on data adhering 
to an over- or under-determined, partially- or fully-perturbed, linear model. 

The specific contributions and organization of the paper are as follows. With unifying notation, Section 
im outlines the pertinent CS-BP-Lasso-TLS context, and introduces the S-TLS formulation and problem 
statement. Section HIT] presents two equivalent formulations, which are first used to establish optimality 
of S-TLS estimators in the maximum a posteriori (MAP) sense, under a fully-perturbed EIV model. 
Subsequently, the same formulations are utilized in Section JV] to develop near-optimum and reduced- 
complexity suboptimum S-TLS solvers with convergence guarantees. The scope of S-TLS is considerably 
broadened in Section |VJ where a priori information on the deterministic structure of the data vector, the 
basis matrix, and/or the statistics of the perturbations is incorporated to develop weighted and structured 
(WS) S-TLS criteria along with associated algorithms to optimize them. The impact of WSS-TLS 
is demonstrated in Section [VI] using two paradigms: cognitive radio sensing, and direction of arrival 
estimation with (possibly uncalibrated) antenna arrays. Simulated tests in Section IVTll illustrate the merits 
of the novel (WS)S-TLS framework relative to BP, Lasso, and TLS alternatives. The paper is wrapped 
up with brief concluding comments and future research directions in Section I VIII I 

Notation: Upper (lower) bold face letters are used throughout to denote matrices (column vectors); 
(•) r denotes transposition; (•)' the matrix pseudo-inverse; vec(-) the column-wise mattix vectorization; 
(g) the Kronecker product; [•] the ceiling function; l m xn the m x n matrix of all ones; Omxn the in x tl 
matrix of all zeros; I the identity matrix; || • \\p the Frobenius norm; and || • \\ p the p-th vector norm for 
p > 1; J\f(fJ., £!) the vector Gaussian distribution with mean /x and covariance S; and p[x = x'\y = y'] 
the conditional probability density function (pdf) of the continuous random variable (r.v.) x taking the 
value x', given that the r.v. y took the value y'. 
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II. Preliminaries and Problem Statement 

Consider the under-determined linear system of equations, y = C0 o , where the unknown n x 1 vector 
O is to be recovered from the given m x 1 data vector y and the m x n matrix C. With m < n and 
no further assumption, only approximations of o are possible using the minimum-norm solution; or, 
the least-squares (LS) regularized by the ^2-norm, which solves in closed form the quadratic problem: 
min^ ||y — C^||| + T || ^ 1 1 § for some chosen 7 > 0. Suppose instead that over a known basis matrix B, 
the unknown vector satisfies o = Bx Q with x Q being sparse, meaning that: 
(asO) The n x 1 vector x D contains more than n — m zero elements at unknown entries. 

Under (asO) and certain conditions on the matrix A := CB, compressive sampling (CS) theory asserts 
that exact recovery of x can be guaranteed by solving the nonconvex, combinatorially complex problem: 
min x ||x||o subject to (s.to) y = Ax. More interestingly, the same assertion holds with quantifiable 
chances if one relaxes the £q- via the ^i-norm, and solves efficiently the convex problem: min x ||x||i s.to 
y = Ax 0, 00, CD- 

Suppose now that due to data perturbations the available vector y adheres only approximately to 
the linear model Ax D . The ^i-norm based formulation accounting for the said perturbations is known 
as basis pursuit (BP) ifTTTl . and the corresponding convex problem written in its Lagrangian form is: 
min x ||y — Ax||| + Ai||x||i, where Ai > is a sparsity-tuning parameter. (For large Ai, the solution is 
driven toward the all-zero vector; whereas for small Ai it tends to the LS solution.) This form of BP 
coincides with the Lasso approach developed for variable selection in linear regression problems |[T9l , 
ll28l . For uniformity with related problems, the BP/Lasso solvers can be equivalently written as 

{XL assoi & Lasso 

}:=argmin ||e||| + Ai ||x||i (la) 

x,e 

s. to y + e = Ax . (lb) 

Two interesting questions arise at this point: i) How is the performance of CS and BP/Lasso based 
reconstruction affected if perturbations appear also in A? and ii) How can sparse vectors be efficiently 
reconstructed from over- and especially under-determined linear regression models while accounting for 
perturbations present in y and/or A? 

In the context of CS, perturbations in A can be due to disturbances in the compressing matrix C, in the 
basis matrix B, or in both. Those in C can be due to non-idealities in the analog implementation of CS; 
while those in B can also emerge because of mismatch between the adopted basis B and the actual one, 
which being unknown, is modeled as B + E#. This mismatch emerges with grid-based approaches to 
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localization, time delay, and spatio-temporal frequency or Doppler estimation GJ, (H, O, J9), lfl6l . ifTTl . 
In these applications, the entries of O have e.g., a sparse discrete-time Fourier transform with peaks 
off the frequency grid {27rfc/re}^~Q, but the postulated B is the fast Fourier transform (FFT) matrix 
built from this canonical grid. In this case, the actual linear relationship is o = (B + E#)x with x G 
sparse. Bounds on the CS reconstruction error under basis mismatch are provided in lll2l : see also ifTOl . 
where the mismatch-induced error was reduced by increasing the grid density. Performance of BP/Lasso 
approaches for the under-determined, fully-perturbed (in both y and A) linear model was analyzed in GUI 
by bounding the reconstruction error, and comparing it against its counterpart derived for the partially- 
perturbed (only in y) model derived in JH. Collectively, lfl2l and GUI address the performance question 
i), but provide no algorithms to address the open research issue ii). 

The overarching theme of the present paper is to address this issue by developing a sparse total least- 
squares (S-TLS) framework. Without exploiting sparsity, TLS has well-documented impact in applications 
as broad as linear prediction, system identification with errors-in-variables (EIV), spectral analysis, image 
reconstruction, speech, and audio processing, to name a few; see 11301 and references therein. For over- 
determined models with unknown vectors x Q not abiding with (asO), TLS estimates are given by 

{x TL5 ,E TLS ,e TLS } := arg min ||[E e]||| (2a) 

l J x,E,e 

s. to y + e = (A + E)x. (2b) 

To cope with ill-conditioned matrices A, an extra constraint bounding 1 1 Tx 1 1 2 is typically added in GJ 
to obtain different regularized TLS estimates depending on the choice of matrix T 0, G6*1 . 

The distinct objective of S-TLS relative to (regularized) TLS is twofold: account for sparsity as per 
(asO), and develop S-TLS solvers especially for under-determined, fully-perturbed linear models. To 
accomplish these goals, one must solve the S-TLS problem concretely formulated (for A > 0) as [cf. (Q]), 

©] 

xs-tls, E S _ TL 5, e S -TLS \ ■= arg mm ||[E e]|| F + A||x||i (3a) 

J x,e,E 



s. to y + e = (A + E)x 



(3b) 



The main goal is to develop efficient algorithms attaining at least the local and hopefully the global 
optimum of ([3]) - a challenging task since presence of the product Ex reveals that the problem is generally 
nonconvex. Similar to LS, BP, Lasso, and TLS, it is also worth stressing that the S-TLS estimates sought 
in © are universal in the sense that perturbations in y and A can be random or deterministic with or 
without a priori known structure. 
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But if prior knowledge is available on the perturbations, can weighted and structured S-TLS problems 
be formulated and solved? Can the scope of S-TLS be generalized (e.g., to recover a sparse matrix X D 
using A and a data matrix Y), and thus have impact in classical applications such as calibration of 
antenna arrays, or contemporary ones, such as cognitive radio sensing? Can S-TLS estimates be (e.g., 
Bayes) optimal if additional modeling assumptions are invoked? These questions will be addressed in 
the ensuing sections, starting from the last one. 

III. MAP Optimality of S-TLS for EIV Models 
Consider the EIV model with perturbed input (A) and perturbed output (y) obeying the relationship 

y = A x + (-e y ) , A = A + (-E A ) (4) 

where the notation of the model perturbations e y and E A stresses their difference with e and E, which 
are variables selected to yield the optimal S-TLS fit in ©. In a system identification setting, e y and E A 
are stationary random processes giving rise to noisy output/input data y/A, based on which the task is to 
estimate the system vector x G (comprising e.g., impulse response or pole-zero parameters), and possibly 
the inaccessible input matrix A G . To assess statistical optimality of the resultant estimators, collect the 
model perturbations in a column-vector form as vec([E,4 e^]), and further assume that: 
(asl) Perturbations of the EIV model in ((U) are independent identically distributed (Ltd.), Gaussian r.v.s, 
i.e., vec({E^ e^]) ~ A/"(0, 1), independent from A G and x G . Entries ofx a are zero-mean, Ltd., according 
to a common Laplace distribution. In addition, either (a) the entries of x Q have common Laplacian 
parameter 2/ A, and are independent from A a , which has Ltd. entries drawn from a zero-mean uniform 
(i.e., non-informative) prior pdf; or, (b) the common Laplacian parameter oj *x entries is 2(<t 2 + 1)/(Acj 2 ), 
and A c conditioned on x has i.i.d. rows with pdf W(0, a 2 [I — (1 + ||x || 2 ,)~ 1 x Xq ]). 

Note that the heavy-tailed Laplacian prior on x G under (asl) is in par with the "non-probabilistic" 
sparsity attribute in (asO). It has been used to establish that the Lasso estimator in £[]) is optimal, in the 
maximum a posteriori (MAP) sense, when E A = ||28l . If on the other hand, x D is viewed as non-sparse, 
deterministic and A c as deterministic or as adhering to (as lb), it is known that the TLS estimator in © 
is optimum in the maximum likelihood (ML) sense for the EIV model in (01); see |23l and |[24l . 

Aiming to establish optimality of S-TLS under (asl), it is useful to re-cast Q as described in the 

following lemma. (This lemma will be used also in developing S-TLS solvers in Section ITVl) 
Lemma 1: The constrained S-TLS formulation in (O is equivalent to two unconstrained (also nonconvex) 
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optimization problems: (a) one involving x and E variables, namely 



xg-TLS, E s „ rLS I = argmin [||y - (A + E)x||| + ||E|fc + A||x||i] j (5) 
and (b) one of fractional form involving only the variable x, expressed as 

(6) 



|y - Ax lli 



*S-tls ■= argmin 2 + A x i 

v x I + NI2 ; 

Proof: To establish the equivalence of (O with ©, simply eliminate e by substituting the constraint 
(l3bl into the cost function of (l3ab - For (O, let v := vec([E e]), and re-write the cost in (l3ab as || [E e] |||, = 
I [ v 1 1 § ; and the constraint (l3bl) as y — Ax = G(x)v, where G(x) := I<g)[x T , —1]. With x fixed, the ^i-norm 
can be dropped from (l3al) . and the reformulated optimization becomes: min v 1 1 v 1 1 § s. to y— Ax = G(x)v. 
But the latter is a minimum-norm LS problem, admitting the closed-form solution 

v(x) = G T (x)[G(x)G T (x)]- 1 (y - Ax) = (1 + ||x||l)- 1 G T (x)(y - Ax) (7) 

where the second equality holds because G(x)G T (x) = || [x T , — 1] H^I = (1 + ||x||2)I. Substituting (0 

back into the cost ||v|||, yields readily the fractional form in ©, which depends solely on x. ■ 

Using Lemma [T] it is possible to establish MAP optimality of the S-TLS estimator as follows. 
Proposition 1: (MAP optimality). Under (asl), the S-TLS estimator in ® is MAP optimal for the EIV 

model in (01). Specifically, (f5]) is MAP optimal for estimating both x Q and A Q under (as la), while © is 

MAP optimal for estimating only x Q under (as lb). 

Proof: Given y and A, the MAP approach to estimating both x„ and A G in (@]) amounts to 

maximizing with respect to (wrt) x and E the logarithm of the posterior pdf denoted as lnp[x G = 

x, A Q = A + E|y, A]. Recalling that x Q and A D are independent under (asla), Bayes' rule implies that 

this is equivalent to: min Xi E — {lnp[y, Ajx D = x, A D = A + E] + mp[x = x] + lnp[A = A + E]}, 

where the summands correspond to the (conditional) log-likelihood and the log-prior pdfs, respectively. 

The log-prior associated with the Laplacian pdf of x is given by 

n n 

lnp[x D = x] =lnJ][(A/4)exp(-A|x I/ |/2)] = -(A/2) W + nln(A/4) (8) 

v=\ v=l 

while the log-prior associated with the uniform pdf of A G is constant under (asla), and thus does not 
affect the MAP criterion. Conditioning the log-likelihood on x G and A Q , implies that the only sources 
of randomness in the data [y A] are the EIV model perturbations, which under (asl) are independent, 
standardized Gaussian; thus, the conditional log-likelihood is lnp[y, A|x = x, A D = A + E] = \n[e y = 
y — ( A + E)x] + lnp[E^ = E]. After omitting terms not dependent on the variables x and E, the latter 
shows that the log-likelihood contributes to the MAP criterion two quadratic terms (sum of two Gaussian 
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exponents): (l/2){||y — (A + E)x||| + ||E||}|,}. Upon combining these quadratic terms with the ^i-norm 
coming from the sum in ©, the log-posterior pdf boils down to the form minimized in (f5]), which per 
Lemma Q] is equivalent to ©, and thus establishes MAP optimality of S-TLS under (asla). 

Proceeding to prove optimality under (as lb), given again the data y and A, consider the MAP approach 
now to estimate only x D in (0]), treating A G as a nuisance parameter matrix that satisfies (as lb). MAP 
here amounts to maximizing (wrt x only) the criterion lnp[x D = x|y, A]; and Bayes' rule leads to the 
equivalent problem min x — {lnp[y, A|x Q = x] + lnp[x D = x]}. But conditioned on x Q , (aslb) dictates 
that A D and [E^ e^] are zero-mean Gaussian and independent. Thus, linearity of the EIV model (0]) 
implies that y and A are zero-mean jointly Gaussian in the conditional log-likelihood. Since rows of A Q 
and [E^ e y ] are (conditionally) i.i.d. under (aslb), the rows of matrix [A y] are independent. In addition, 
the pth-row of [A y] denoted as [aj y p ] , has inverse (conditional) covariance matrix 



E 



Up 



[ a T Vp\ 



(a 2 + l)I-a 2 xx'7(l + ||x||l) 



^ 1)1 - <r 2 xx T / 
a 2 xV(l + ||x|||) 



<x 2 x/(l + ||x|| 2 ) 
l + a 2 ||x|| 2 /(l + ||x|| 2 ) 



a 2 + l 



1 + 



a 



l+l|x|l 2 



X 


[x T -l]j 


-1 





(9) 



with determinant 1/(<t 2 + l) n not a function of x. After omitting such terms not dependent on x, and 
using the independence among rows and their inverse covariance in ©, the conditional log-likelihood 
boils down to the fractional form II y ~~ -^ x lli/ (l + ll x lll)- Since the Laplacian parameter under 

(aslb) equals 2(cr 2 + 1)/(Act 2 ), the lo g-prior in (HJ changes accordingly; and together with the fractional 
form of the log-likelihood reduces the negative log-posterior to the cost in ©. This establishes MAP 
optimality of the equivalent S-TLS in ((3]) for estimating only x Q in (0]), under (aslb). ■ 
Proposition Q] will be generalized in Section [V] to account for structured and correlated perturbations 
with known covariance matrix. But before pursuing these generalizations, S-TLS solvers of the problem 
in ([3]) are in order. 



IV. S-TLS Solvers 

Two iterative algorithms are developed in this section to solve the S-TLS problem in ©, which was 
equivalently re-formulated as in ((5]) and ©. The first algorithm can approach the global optimum but 
is computationally demanding; while the second one guarantees convergence to a local optimum but is 
computationally efficient. Thus, in addition to being attractive on its own, the second algorithm can serve 
as initialization to speed up convergence (and thus reduce computational burden) of the first one. To 
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appreciate the challenge and the associated performance-complexity tradeoffs in developing algorithms 
for optimizing S-TLS criteria, it is useful to recall that all S-TLS problems are nonconvex; hence, unlike 
ordinary TLS that can be globally optimized (e.g., via SVD 11231 ), no efficient convex optimization solver 
is available with guaranteed convergence to the global optimum of ©, ©, or ©. 

A. Bisection-based e-Optimal Algorithm 

Viewing the cost in © as a Lagrangian function, allows casting this unconstrained minimization 
problem as a constrained one. Indeed, sufficiency of the Lagrange multiplier theory implies that Q Sec 
3.3.4]: using the solution ~ks-TLS of © for a given multiplier A > and letting \i := ||xs_t.ls||i, the 
pertinent constraint is Xi(fx) := {x £ W 1 : ||x||i < /i}; and the equivalent constrained minimization 
problem is [cf. ©] 



|y- Ax |' 2 

'x6aT(aO J ' J ' 1 + ||x|| 2 



xs-tls := arg min /(x) , /(x) := llJ ~^!! 2 . (10) 



|2 

There is no need to solve © in order to specify \i, because a cross-validation scheme can be implemented 
to specify \i in the stand-alone problem (TTOl . along the lines used by e.g., E51 to determine A in ©. The 
remainder of this subsection will thus develop an iterative scheme converging to the global optimum of 
(fTOl . bearing in mind that this equivalently solves ©, © and © too. 

From a high-level view, the novel scheme comprises an outer iteration loop based on the bisection 
method lfl4l . and an inner iteration loop that relies on a variant of the branch-and-bound (BB) method 
0]]. A related approach was pursued in (5j to solve the clairvoyant TLS problem © under £2 -norm regu- 
larization constraints. The challenging difference with the S-TLS here is precisely the non-differentiable 
^i-norm constraint in X\(fi). The outer iteration "squeezes" the minimum cost /(x) in (fTOl between 
successively shrinking lower and upper bounds expressible through a parameter a. Per outer iteration, 
these bounds are obtained via inner iterations equivalently minimizing a surrogate quadratic function 
g(x, a), which does not have fractional form, and is thus more convenient to optimize than /(x). 

Given an upper bound a on /(x), the link between /(x) and g(x, a) follows if ones notes that 



|y - Ax 



2 



< a* := min f(x) = min — --7^- <a (11) 

xeA'if/j) xe^i(/i) 1 + ||x||2 



is equivalent to 



g*{a) := min g(x, a) = min { lly — Ax||| — a(l + ||x|||)) < 0. (12) 

Suppose that after outer iteration i the optimum a* in (TTTb belongs to a known interval Zj := 
Suppose further that the inner loop yields the global optimum in (fT2l for a = (k + Uj)/2, and consider 
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evaluating the sign of g*(a) at this middle point a = (k + Ui)/2 of the interval Zj. If <?*((/» + Mj)/2) > 0, 
the equivalence between (fl2l) and (fTTT) implies that a* > (k + Uj)/2 > Zj; and hence, a* £ Zi + i := 
[(/j + Mj)/2,Mj], which yields a reduced-size interval X i+1 by shrinking Zj from the left. On the other 
hand, if g*((k + Ui)/2) < 0, the said equivalence will imply that a* € li+i ■= [k, (k + Ui)/2\, which 
shrinks the Zj interval from the right. This successive shrinkage through bisection explains how the outer 
iteration converges to the global optimum of (flOl) . 

What is left before asserting rigorously this convergence, is to develop the inner iteration which ensures 
that the global optimum in (fT2l can be approached for any given a specified by the outer bisection-based 
iteration. To appreciate the difficulty here note that the Hessian of g(x, a) is given by H := 2(A T A-aI). 
Clearly, H is not guaranteed to be positive or negative definite since a is positive. As a result, the cost 
g(x, a) in (fT2l bypasses the fractional form of /(x) but it is still an indefinite quadratic, and hence 
nonconvex. Nonetheless, the quadratic form of g(x, a) allows adapting the BB iteration of HI, which can 
yield a feasible and <5-optimum solution x* satisfying: a) x* G Xi(fi); and b) g*(a) < <?(x* a) < g*(a)+S, 
where 5 denotes a pre-specified margin. 

In the present context, the BB algorithm finds successive upper and lower bounds of the function 

ffbox(a) : = „, min ^ 0(x,a) (13) 

XgA^i (jU),Xi <X<X,7 

where the constraint < x < x;/ represents a box that shrinks as iterations progress. Upon converting 
the constraints of ( fT3l to linear ones, upper bounds U on the function g^ m (a) in ( fT3l can be readily 
obtained via suboptimum solvers of the constrained optimization of the indefinite quadratic cost </(x, a); 
see e.g., Q Chp. 2]. Lower bounds on g^ ox {a) can be obtained by minimizing a convex function <7l(x, a), 
which under-approximates g(x, a) over the interval x^ < x < X[/. This convex approximant is given by 

g L (x, a) = c/(x, a) + (x - x L ) T D(x - x v ) (14) 

where D is a diagonal positive semi-definite matrix chosen to ensure that <?i,(x, a) is convex, and stays 
as close as possible below g(x, a). Such a matrix D can be found by minimizing the maximum distance 
between #l(x, a) and g(x,a), and comes out as the solution of the following minimization problem: 

min (x[/ - x L ) T D(x(/ - x L ) s. to H + 2D ± (15) 

where the constraint on the Hessian ensures that <?i,(x, a) remains convex. Since ([TBI ) is a semi-definite 
program, it can be solved efficiently using available convex optimization software; e.g., the interior point 
optimization routine in SeDuMi ||27ll . Having selected D as in ( fT5l ), min x6/Yi ^ Xi < x < Xi7 5l(x, a) is a 
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convex problem (quadratic cost under linear constraints); thus, similar to the upper bound U, the lower 
bound C on g^ OK {a) can be obtained efficiently. 

The detailed inner loop (BB scheme) is tabulated as Algorithm 1 1-aj It amounts to successively splitting 
the initial box — fxl < x < //l, which is the smallest one containing PC\{n). Per inner iteration i, variable 
U keeps track of the upper bound on <j£ ox (a), which at the end outputs to the outer loop the nearest 
estimate of g*{a). Concurrently, the lower bound L on g^ ox (a) determines whether the current box needs 
to be further split, or discarded, if the difference U — £ is smaller than the pre-selected margin 8. This 
iterative splitting leads to a decreasing U and a tighter C, both of which prevent further splitting. 

Recapitulating, the outer bisection-based iteration tabulated as Algorithm ll-bl calls Algorithm ll-al to 

find a feasible 5-optimal solution x* to evaluate the sign of g*(a) in (fl2"l ). Since x* is not the exact 

global minimum of (fl2l ). positivity of <?(x*, a) does not necessarily imply g*(a) > 0. But x* is 5- 

optimal, meaning that g*{a) > g(x*,a) — 5; thus, g(pt*a) > 5, in which case the lower bound U + \ is 

updated to (Zj + m) /2; otherwise, if g(x*, a) £ (0, S), then Zj+i should be set to (Zj + u{) /2 — 5. 

As far as convergence is concerned, the following result can be established. 
Proposition 2: (e-optimal convergence) After at most 

outputs an e-optimal solution x* to (110k that is, 



K^)/M2) 



iterations, Algorithm \l-b 



x* G Xi(ti), and a* < /(x*) < a* + e. (16) 
Proof: Upon updating the lower and upper bounds, it holds per outer iteration i > 1 that ui — Zj < 
— Zj_i) + 5; and by induction, Uj — k < (^)'«o + 25, when Zo = 0. The latter implies that if the 



number of iterations i > In ( ) / lnf2) , the distance Uj — Zj < e is satisfied. 

Since per outer iteration Algorithm 1 1 -al outputs x* £ Afi(/z), it holds that the updated x* is also feasible. 
Further, the bisection process guarantees that h < a* < /(x*) < ui per iteration i. Since Algorithm ll-bl 
ends with U{ — k < e, the inequality in (fT6l) follows readily. ■ 
Proposition |2] quantifies the number of outer iterations needed by the bisection-based Algorithm ll-bl to 
approach within e the global optimum of dTOb - In addition, the inner (BB) iterations bounding g£ ox (a) are 
expected to be fast converging because the box function in (fT3T ) is tailored for the box constraints induced 
by the ^i-norm regularization. Nonetheless, similar to all BB algorithms, the complexity of Algorithm 
ll-al does not have guaranteed polynomial complexity on average. The latter necessitates as few calls of 
Algorithm ll-al which means as few outer iterations. Proposition [2] reveals that critical to this end is the 
initial upper bound no (Algorithm |l-b| simply initializes with no = /(0)). 

This motivates the efficient suboptimal S-TLS solver of the next subsection, which is of paramount 
importance not only on its own, but also for initializing the e-optimal algorithm. 
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B. Alternating Descent Sub-Optimal Algorithm 

The starting point for a computationally efficient S-TLS solver is the formulation in (f5]). Given E, the 
cost in §5$ has the form of the Lasso problem in (QJ; while given x, it reduces to a quadratic form, which 
admits closed-form solution wrt E. These observations suggest an iterative block coordinate descent 
algorithm yielding successive estimates of x with E fixed, and alternately of E with x fixed. Specifically, 
with the iterate E(i) given per iteration i > 0, the iterate x(i) is obtained by solving the Lasso-like 
convex problem as [cf. ([TJ] 

x(i) = arg min ||y — [A + E(z)]x||2 + A||x||i . (17) 
With x(i) available, E(i + 1) for the ensuing iteration is found as 

E(i + 1) = argmin||y - Ax(i) - Ex(i)||o + ||E||^. (18) 

E 

By setting the first-order derivative of the cost wrt E equal to zero, the optimal solution to the quadratic 
problem (fT8l) is obtained in closed form as 

E(t + 1) = (1 + Mi)\\ 2 2 r l [y - Ax(i)]x T (z). (19) 

The iterations are initialized at i = by setting E(0) = mxn . Substituting the latter into ( fT71 ), yields 
x(0) = x.L asso in ([TJ. That this is a good initial estimate is corroborated by the result in |[20l . which 
shows that even with perturbations present in both A and y, the CS (and thus Lasso) estimators yield 
accurate reconstruction. In view of the fact that the block coordinate descent iterations ensure that the 
cost in ([51) is non-increasing, the final estimates upon convergence will be at least as accurate. 

The block coordinate descent algorithm is provably convergent to a stationary point of the S-TLS cost 

in (f5]), and thus to its equivalent forms in ([3]), © and ( fTOl ), as asserted in the following proposition. 
Proposition 3: ( Convergence of alternating descent) Given arbitrary initialization, the iterates {E(i), x(i)} 

given by (1171 ) and d 19b converge monotonically at least to a stationary point of the S-TLS problem (f3]). 
Proof: The argument relies on the basic convergence result in |29|. The alternating descent algorithm 

specified by (fT7T ) and ( fT9l is a special case of the block coordinate descent method using the cyclic rule 

for minimizing the cost in ©. The first two summands of this cost are differentiable wrt the optimization 

variables, while the non-differential third term (^i-norm regularization) is separable in the entries of x. 

Hence, the three summands satisfy the assumptions (B1)-(B3) and (C2) in ||29l . Convergence of the 

iterates {E(i),x(z)} to a coordinate minimum point of the cost thus follows by appealing to ||29l Thm. 

5.1]. Moreover, the first summand is Gateaux-differentiable over its domain which is open. Hence, the 

cost in (|5]) is regular at each coordinate's minimum point, and every coordinate's minimum point becomes 
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a stationary point; see 11291 Lemma 3.1]. Monotonicity of the convergence follows simply because the 
cost per iteration may either reduce or maintain its value. ■ 

Proposition [3] solidifies the merits of the alternating descent S-TLS solver. Simulated tests will further 
demonstrate that the local optimum guaranteed by this computationally efficient scheme is very close to 
the global optimum attained by the more complex scheme of the previous subsection. 

Since estimating E is simple using the closed form in (TT8t . it is useful at this point to explore 
modifications, extensions and tailored solvers for the problem in (fTTT ) by adapting to the present setup 
existing results from the Lasso literature dealing with problem CD). From the plethora of available options 
to solve ([T7t . it is worth mentioning two computationally efficient ones: the least-angle regression (LARS), 
and the coordinate descent (CD); see e.g., |fT9l . LARS provides the entire "solution path" of (fTTl ) for all 
A > at complexity comparable to LS. On the other hand, if a single "best" value of A is fixed using 
the cross-validation scheme 11251 . then CD is the state-of-the-art choice for solving (fTTl) . 

CD in the present context cycles between iterates E(i), and scalar iterates of the x(i) entries. Suppose 
that the z^-th entry x v (i) is to be found. Precursor entries {xi(i), . . . , x u -i(i)} have been already obtained 
in the z-th iteration, and postcursor entries {x u +i(i— 1), . . . , x n (i— 1)} are also available from the previous 
(i — l)-st iteration along with E(i) obtained in closed form as in (fT9l . If ot v {i) denotes the z^-th column 
of [A + E(i)], the effect of these known entries can be removed from y by forming 

v — 1 n 

&v(i) ■= y - ^2<*j(i)xj(i) ~ ^2 oij(i)xj(i - 1) . (20) 

0=1 j=v+l 

Using ( TSUI , the vector optimization problem in (fTTT ) reduces to the following scalar one with x u (i) as 
unknown: x v (i) = argmin Xi , [||ej,(i) — ay^a^Hl + A|x^|]. This scalar Lasso problem is known to admit 
a closed-form solution expressed in terms of a soft thresholding operator (see e.g., fl9l ) 



x v (i) = sign (el(i)a u (i)) 



el(i)aL v {i) A 



\a v (i)g 2j|a„(^' 2 



l,...,n (21) 



12- 

where sign(-) denotes the sign operator, and [x]+ ■= X, if X > 0, and zero otherwise. 

Cycling through the closed forms dT9T>-(F2TT) explains why CD here is faster than, and thus preferable 
over general-purpose convex optimization solvers of ( fTTl ). Another factor contributing to its speed is 
the sparsity of x(i), which implies that starting up with the all-zero vector, namely x(— 1) = nx i> 
offers initialization close to a stationary point of the cost in ([5]). Convergence to this stationary point is 
guaranteed by using the results in 11291 , along the lines of Proposition [3] Note also that larger values 
of A in (1211 ) force more entries of x(i) to be shrunk to zero, which corroborates the role of A as a 
sparsity-tuning parameter. The CD based S-TLS solver is tabulated as Algorithm |2] 
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Remark 1: (Regularization options for S-TLS) Lasso estimators are known to be biased, but modifications 
are available to remedy bias effects. One such modification is the weighted Lasso, which replaces the 
^-norm in ([3]) by its weighted version, namely Y^Z=i w v\ x v\> where the weights {w u } are chosen using 
the LS solution ll3~Tll . An alternative popular choice is to replace the ^i-norm with concave regularization 
terms |[T5l , such as J2v=i^°&( x l' + £l)> where Si is a small positive constant introduced to avoid 
numerical instability. In addition to mitigating bias effects, concave regularization terms provide tighter 
approximations to the ^o-(P seu do)norm, and although they render the cost in © nonconvex, they are 
known to converge very fast to an improved estimate of x, when initialized with the Lasso solution |fT51 . 

Remark 2: (Group Lasso and Matrix S-TLS) When groups {x 5 }^ =1 of x entries are a priori known to 
be zero or nonzero (as a group), the ^i-norm in ([3]) must be replaced by the sum of ^2-norms, namely 
Yl g =i I! x gll2- The resulting group S-TLS estimate can be obtained using the group-Lasso solver |fT9l . In 
the present context, this is further useful if one considers the matrix counterpart of the S-TLS problem 
in (O, which in its unconstrained form can be written as [cf. (|5])] 

n 

|Y - (A + E)X|||. + ||E||| + A W x uh (22) 



Xs-tls, E 5 _tls \ = argmin 



u=l J 

where x^ denotes the z/-th row of the n x L unknown matrix X, which is sparse in the sense that a 
number of its rows are zero, and has to be estimated using an m x L data matrix Y along with the 
regression matrix A, both with perturbations present. Problem (l22l ) can be solved using block coordinate 
descent cycling between iterates E(i) and rows x^(z) as opposed to scalar entries as in (|2TI ). 

V. Weighted and Structured S-TLS 

Apart from the optimality links established in Proposition [TJ under (asl), the S-TLS criteria in (O, ©, 
and (O make no assumption on the perturbations [Ee]. In this sense, the S-TLS solvers of the previous 
section find universal applicability. However, one expects that exploiting prior information on [E e], can 
only lead to improved performance. Thinking for instance along the lines of weighted LS, one is motivated 
to weight ||E|||, and ||e||2 in © by the inverse covariance matrix of E and e, respectively, whenever 
those are known and are not both equal to I. As a second motivating example, normal equations, involved 
in e.g., linear prediction, entail structure in E and e that capture sample estimation errors present in the 
matrix [Ay], which is Toeplitz. Prompted by these examples, this section is about broadening the scope 
of S-TLS with weighted and structured forms capitalizing on prior information available about the matrix 
[E e] . To this end, it is prudent to quantify first the notion of structure. 
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Definition 1. The mx (n + 1) data matrix [Ay](p) has structure characterized by an n p xl parameter 
vector p, if and only if there is a mapping such that p £ W n " — > [Ay](p) := S(p) € M mx ( n+1 ). 

Definition Q] is general enough to encompass any (even unstructured) matrix [Ay](p), by simply 
letting p := vec([A y]) € R m ( n+1 ) comprise all entries of [Ay]. However, it becomes more relevant 
when rip <C m(n + 1), the case in which p characterizes [A y] parsimoniously. Application examples 
are abundant: structure in Toeplitz and Hankel matrices encountered with system identification, decon- 
volution, and linear prediction; as well as in circulant and Vandermonde matrices showing up in spatio- 
temporal harmonic retrieval problems |[23l . Structured matrices A and sparse vectors x D emerge also in 
contemporary CS gridding-based applications e.g., for spectral analysis and estimation of time-varying 
channels, where rows of the FFT matrix are selected at random. (This last setting appears when training 
orthogonal frequency-division multiplexing (OFDM) input symbols are used to estimate communication 
links exhibiting variations due to mobility-induced Doppler effects ||6*1.) 

Consider now re-casting the S-TLS criteria in terms of p, and its associated perturbation vector denoted 
by e € M. np . The Frobenius norm in the cost of ( f3ab is mapped to the ^-norm of e; and to allow for 
weighting the structured perturbation vector using a symmetric positive definite matrix W € K nfXn,, l 
the weighted counterpart of ||[E e]||fi becomes e T We. With regards to the constraint, recall first from 
Definition Q] that S(p) = [Ay], which implies S(p + e) = [A + E y + e]; hence, re- writing d3bl) as 
[A + E y + e] [x T , — l] T = 0, yields the structured constraint as S(p + e) [x T , — l] T = 0. Putting things 
together, leads to the combined weighted-structured S-TLS version of ® as 

min e T We + A||x||i (23a) 



s. to S(p + e) 



(23b) 



x 
-1 

which clearly subsumes the structure-only form as a special case corresponding to W = I. 

To confine the structure quantified in Definition [T] two conditions will be imposed, which are commonly 
adopted by TLS approaches ||23l , and are satisfied by most applications mentioned so far. 
(as2) The structure mapping in Definition\l\is separable, meaning that with p = [(p^) r (p y ) T ] T > where 
p A <G M. nA and p y G M n », it holds that S(p) := [Ay](p) = [A(p A ) y(p y )]. In addition, the separable 
structure mapping is linear (more precisely affine), if and only if the S(p) matrix is composed of known 
structural elements, namely "matrix atoms" So, {S^}^ 1 and "vector atoms" {s^}^! =1 , so that 



S(p) = So + 



n A 

y 
■k 

,k=l k=l 



(24) 
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where p A (p y k ) denotes the k-th entry of p A (p y ). 

Similar to Definition [H (124b is general enough to encompass even unstructured matrices S(p) := [Ay], 
by setting So = 0, p := vec([A y]) G R m ( n + 1 ), n p := ua + n y = mn + m, and selecting the m vector 
atoms (mn matrix atoms) as the canonical vectors (matrices), each with one entry equal to 1 and all 
others equal to 0. Again, interesting structures are those with ua <C mn and/or n y <C m. (Consider for 
instance a circulant mxn matrix A, which can be represented as in (1241 using ua = m matrix atoms.) 

Separability and linearity will turn out to simplify the constraint in (I23bb for some given matrix atoms 
and vector atoms collected for notational brevity in the matrices 



[Sf--Sfj and S y : 



(25) 



Indeed, linearity in (as2) allows one to write S(p + e) = S(p) + S(e), and the constraint (I23bb 



as: S(e)[x T ,-l] 



IT 



S(p)[x T ,— l] 1 = y — Ax; while separability implies that S(e)[x' J ,— 1] J 



IT 



[Ek=i4 S k ElLi44][x T ,- l ] T = S A {I®x)e A -S y £ y > ^ere the definitions e := [(e A ) T (e y ) T ] T 
and (1251 were used in the last equality along with the identity Ylk=i e ^fc x = S A (I <S> x)e j4 . In a 
nutshell, (I23bb under (as2) becomes S A (I®x)e A — S y e y = y — Ax, in which e A is decoupled from e y . 
Therefore, the weighted and structured (WS)S-TLS problem in (1231 reduces to [cf. ©] 



mm 





T 




" e A ' 




" e A ' 




w 




e y 




e y 



+ Allxl 



s. to [S A (I®x) - S y ] 



Ax 



(26a) 



(26b) 



or in a more compact form as: min Xj e {e T W e + A||x||i} s.to G(x)e = r(x), after defining 

G(x) := [S A (I ® x) S y ] and r(x) := y - Ax . 



(27) 



Comparing ® with (f26T > allows one to draw apparent analogies: both involve three sets of optimization 
variables, and both are nonconvex because two of these sets enter the corresponding constraints in a 
bilinear fashion [cf. product of E with x in (l3bl . and e A with x in (I26bl )l. 

Building on these analogies, the following lemma shows how to formulate WSS-TLS criteria, parallel- 
ing those of Lemma [T] where one or two sets of variables were eliminated to obtain efficient, provably 

convergent solvers, and establish statistical optimality links within the EIV model in (0]). 

Lemma 2: The constrained WSS-TLS form in (1231 ) is equivalent to two unconstrained nonconvex 
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optimization problems: (a) one involving x and e A variables, namely 

T 



{x, e }\VSS-TLS = arg min 

x,e A 



(^)t[S^(I®x)e A -r(x)] 
+ Allxlli 



W 



(5^)t[5 A (I®x)e A -r(x)] 



(28) 



where S y is assumed full rank and square 



3 i.e., 



m = n, 



in (127 1 ); and also (b) one involving only the 



variable x, expressed using the definitions in (1271 ). as 

*WSS-tls = arg min {r T (x) [G(x) W" 1 G T (x)] f r(x) + A||x||i} . (29) 

Proof: Constraint (I26bb can be solved uniquely for e y to obtain e y = (S y )^ [S A (l<g>x)e A — (y — Ax)]. 
Plug the latter with the definition of r(x) from (1271 ) into the quadratic form in (|26a| ) to recognize that 
(1261 ) is equivalent to the unconstrained form in (128T ) with the e y variable eliminated. 

To arrive at d29l . suppose that x is given and view the compact form of (l26l ) (after ignoring A||x||i) as 
the following weighted minimum-norm LS problem: min x e {e T We+A||x||i} s.to G(x)e = r(x). Solving 
the latter in closed form expresses e in terms of x as: e(x) = W _1 G T (x) [G(x)W -1 G T (x)] r(x). 
Substitute now e(x) back into the cost, and reinstate A||x||i, to obtain (|29l ). ■ 

The formulation in (|28T ) suggests directly an iterative WSS-TLS solver based on the block coordinate 
descent method. Specifically, suppose that the estimate e A (i) of e A is available at iteration i. Substituting 
e A (i) into (T28T ). allows estimating x as 



x(i) = arg min 



e A (i) 

(^)t[5 A (I®x)e A (i)-r(x)] 



W 



e A (i) 

(5'f)t[5 A (I®x)e A (i)-r(x)] 



+A||x||i. (30) 



Since r(x) is linear in x [cf. ([27])], the cost in (f30l > is convex (quadratic regularized by the ^-norm as in 
the Lasso cost in ([]])); thus, it can be minimized efficiently. Likewise, given x(i) the perturbation vector 
for the ensuing iteration can be found in closed form since the pertinent cost is quadratic; that is, 



e A {i + 1) = arg min 



(Sf)t[S A (I®x(i))e A -r(x(i))] 



W 



(5^)t[5 A (I®x(i))e A -r(x(z))] 



(31) 



'Tall S v matrices with full column rank can be handled too for block diagonal weight matrices W typically adopted with 
separable structures. This explains why the pseudo-inverse of S v is used in this section instead of its inverse; but exposition 
of the proof simplifies considerably for the square case. Note also that the full rank assumption is not practically restrictive 
because data matrices perturbed by noise of absolutely continuous pdf have full rank almost surely. 
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w 



(32) 



To express e A (i + 1) compactly, partition W in accordance with p = [(p" 4 ) 7 ", (p y ) T ] T ', i.e., let 

Waa W Ay 

™Ay W,, 

Using (l32l) . and equating to zero the gradient (wrt e A ) of the cost in (f3TT >. yields the closed form 

e A (t + 1) = {S(x(i))WS T (x( ! ))} t S(x(i))[W^, W^] T (^)tr(x(i)) (33) 

where S(x(i)) := [i, [(S^S^I ® x(i))] T . 

Initialized with e^(0) = nAX i, the algorithm cycles between iterations (|3Qb and d33l . Mimicking the 

steps of Proposition [3l it is easy to show that these iterations are convergent as asserted in the following. 
Proposition 4: (Convergence). The iterates in d301 > and d33D converge monotonically at least to a 

stationary point of the cost in (123I ). provided that S v in (1271 ) has full column rank. 

As with the solver of Section ITV-B1 CD is also applicable to the WSS-TLS solver, by cycling between 

e A (i) and scalar iterates of the x(z) entries. To update the z^-fh entry x v (i), suppose precursor entries 
{xi(i), . . . , x v -i(i)} have been already obtained in the i-th iteration, and postcursor entries {x u+ \(i — 
1), . . . , x n (i — 1)} are also available from the previous (i — l)-st iteration along with e A (i), found in 
closed form as in ( f33t . Letting ct v (i) denote the u-th column of (A + $2fc=i e k ^fc)]' tne e ff ect 

of these known entries can be removed from y by forming [cf. (1201 )1 

v — 1 n 

e v (i):=(S v )^y-J2a j (i)x j (i)- £ aj (i) Xj (i - 1) . (34) 

Using d34l , the vector optimization problem in (|3Qb now reduces to the following scalar one with x v (i) as 
unknown: x u (i) = axgmin a;>/ {||a I/ (i)a; v — («) 1 1 vp^^ + 2[e A (i)] T W Ay a u (i)x u + X\x u \}, where || • \\ Wyy 
denotes the i^-norm weighted by W w . The solution of this scalar Lasso problem can be expressed using 
the same soft-thresholding form as in (|2"TT ). and is given by 



X,Al 



sign ([e„(i)W w - e A (i)Wl/ct u (i)) 



[e„(i)W yy - e A ii)W Ay Y cc v {i 



X 



\ct u (i^ 2 



2\\a l/ (i^ 2 



(35) 



This block CD algorithm enjoys fast convergence (at least) to a stationary point, thanks both to the 
simplicity of 051 ), and the sparsity of x(i). 

The WSS-TLS criterion in d28l) is also useful to establish its statistical optimality under a structured 
EIV model, with output-input data obeying the relationships 



I! A 



y = A(p^)x + (-S»ey) , A = A(p^) + - £ e A , k S 



(36) 



k=l 
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where perturbation vectors and e y play the role of and e y in (0]), and differ from the optimization 
variables e A and e y in (l26l ). Unknown are the vector x G , and the inaccessible input matrix A(p^), 
characterized by the vector p A . The model in 061 ) obeys the following structured counterpart of (asla). 
(asl') Perturbations in (136b are jointly Gaussian, i.e., [e^ e y ] ~ AA(0,W~ 1 ), as well as independent 
from p A and x G . Vector x G has i.i.d. entries with the same prior as in (asla); and it is independent from 
p A , which has i.i.d. entries drawn from a zero-mean uniform (i.e., non-informative) prior pdf 

The following optimality claim holds for the WSS-TLS estimator in (l28l) . assured to be equivalent to 
the solution of problem d26l) by Lemma [2 

Proposition 5: (MAP optimality of WSS-TLS). Under (asl') and (asl), the equivalent WSS-TLS problem 

in d28l ) yields the MAP optimal estimator of x G and pa in the structured EIV model ( I36I ). 

Proof: The proof follows the lines used in proving the MAP optimality of ([5]) under (asla) in 

Proposition [TJ The log-prior pdf of x Q contains an £i-norm term as in ([8]), while the uniform prior on 

p A is constant under (asl'). Furthermore, given the structure mapping [A y] = S(p), the conditional 

log-likelihood here can be expressed in terms of x and e A , as lnp[y, A|x Q = x, p A = p A + e A ] = 

Inp [e A = e A , e y = -(S y )^[y - (A + Y%=i 4 Sj£)x] = (S» ) ] [S A (l ® x)e A - r(x)]] . After omitting 

terms not dependent on x and e A , the conditional log-likelihood under the joint Gaussian distribution in 

(asl') boils down to half of the quadratic cost in (|28T ). Combining the latter with ||x||i from the log-prior 

pdf, it follows that maximizing the log-posterior pdf amounts to minimizing the unconstrained sum of 

the two, which establishes MAP optimality of the WSS-TLS estimator in (l28i ■ 

VI. S-TLS Applications 

In this section, the practical impact of accounting for perturbations present in the data matrix [A y] 
will be demonstrated via two sensing applications involving reconstruction of sparse vectors. In both, the 
perturbation comes from inaccurate modeling of the underlying actual matrix A G , while e y is due to 
measurement noise. 

A. Cognitive Radio Sensing 

Consider N s sources located at unknown positions, each transmitting an RF signal with power spectral 
density (PSD) c & s (/) that is well approximated by a basis expansion model: & s (f) = "Y^vLi x svb v (f), 
where {b u (f)}^l 1 are known (e.g., rectangular) basis functions, and are unknown power 

coefficients. As source positions are also unknown, a Cartesian grid of known points {Ig}^^ is adopted 
to describe candidate locations that transmitting radios could be positioned 0], 0; see also Fig. Q] 
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The task is to estimate the locations and powers of active sources based on PSD samples measured at 
N r cognitive radios (CRs) at known locations {£ r }^ 1 . Per frequency f k , these samples obey the model 

N g N g N b 

*r(/k) = Y,^ g (fk)W r +e r (f k ) = E^v^(/ fc )]v+ ff '+er(/fc) = at(f k )x +e r (f k ) (37) 

9=1 9=1 v=l 

where the PSD is nonzero only if a transmitting source is present at l g ; ^y gr represents the channel 

gain from the candidate source at l g to the CR at £ r that is assumed to follow a known pathloss function 
of the distance \\l g — £ r \\; of denotes the known noise variance at receiver r; the N},N g x 1 vector a r (/fc) 
collects products jg r b u (fk)', vector x G contains the NbN g unknown power coefficients x gu ; and e r (fk) 
captures the error between the true, & r {fk), and estimated, <&r(/fc)> PSDs. 

Estimated PSD samples at K frequencies from all N r receivers are first compensated by subtracting the 
corresponding noise variances, and subsequently collected to form the data vector y of length m = KN r . 
Noise terms —e T (f k ) are similarly collected to build the perturbation vector e y . Likewise, row vectors 
a^(fk) of length n = Nt,N g are concatenated to form the m x n matrix A. The latter is perturbed 
(relative to the inaccessible A D ) by a matrix E^, which accounts for the mismatch between grid location 
vectors, {l g }^l v and those of the actual sources, {kJ^p To specify E^, let e gr := 7 sr — j gr for the 
source at k s closest to l g , and substitute 7 ffr = j sr — e s gr into the double sum inside the square brackets 
of (|37T ). This allows writing A = A D — E^, where E^ is affine structured with coefficients {e* r } and 
matrix atoms formed by {b u (fk)}- All in all, the setup fits nicely the structured EIV model in (l36l ). 

Together with {e s gr }, the support of x D estimates the locations of sources, and the nonzero entries 
of x their transmit powers. Remarkably, this grid-based approach reduces localization - traditionally a 
nonlinear estimation task - to a linear one, by increasing the problem dimensionality (N g S> N s ). What is 
more, x D is sparse for two reasons: a) relative to the swath of available bandwidth, the transmitted PSDs 
are narrowband; hence, the number of nonzero x gu s is small relative to iV&; and (b) the number of actual 
sources (N s ) is much smaller than the number of grid points (N g ) that is chosen large enough to localize 
sources with sufficiently high resolution. Existing sparsity-exploiting approaches to CR sensing rely on 
BP/Lasso, and do not take into account the mismatch arising due to griding (H, O. Simulations in Section 
IVIII will demonstrate that sensing accuracy improves considerably if one accounts for grid-induced errors 
through the EIV model, and compensates for them via the novel WSS-TLS estimators. 

B. DoA Estimation via Sparse Linear Regression 

The setup here is the classical one in sensor array processing: plane waves from N s far-field, narrow- 
band sources impinge on a uniformly-spaced linear array (ULA) of N r (possibly uncalibrated) antenna 
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elements. Based on as few N r X 1 vectors of spatial samples collected across the ULA per time instant 
t (snapshot), the task is to localize sources by estimating their directions-of-arrival (DoA) denoted as 
High-resolution, (weighted) subspace-based DoA estimators are nonlinear, and rely on the 
sample covariance matrix of these spatio-temporal samples, which requires a relatively large number 
of snapshots for reliable estimation especially when the array is not calibrated; see e.g., ETI . This has 
prompted recent DoA estimators based on sparse linear regression, which rely on a uniform polar grid 
of N g points describing candidate Do As {9 g } g =i ifTTl . ll22l . Similar to the CR sensing problem, the 
g-th entry x g of the iVj x 1 unknown vector of regression coefficients, x o i , is nonzero and equal to the 
transmit-source signal power, if a source is impinging at angle 9 g , and zero otherwise. 

The iV r x 1 array response vector to a candidate source at DoA 9 g is a(9 g ) = [1 e - - 7 " 9 • • • e~^ a ^ Nr ^ 1 ^} T , 
where a g := 2irdsm(9 g ) denotes the phase shift relative to the source signal wavelength between 
neighboring ULA elements separated by distance d. The per-snapshot received data vector y t of length 
m = N r obeys the EIV model: y t = (A + E^) x o 4 + (— e y> t), where — e y t t represents the additive noise 
across the array elements; the m x n matrix A := [a(#i) • • • &(9n cj )] denotes the grid angle scanning 
matrix of n = N g columns; and represents perturbations arising because Do As from actual sources 
do not necessarily lie on the postulated grid points. Matrix E^ can also account for gain, phase, and 
position errors of antenna elements when the array is uncalibrated. 

To demonstrate how a structured S-TLS approach applies to the DoA estimation problem at hand, con- 
sider for simplicity one source from direction $ s , whose nearest grid angle is 9 g ; and let e s g := "9 s — 9 g be 
the corresponding error that vanishes as the grid density grows large. For small e g , the actual source-array 
phase shift a s := 2irdsm(9 g + £ g ) can be safely approximated as 2ird[sm(9 g ) cos{e s g ) + cos(# 5 ) sin(ep] m 
27rd[s'm(9g) + e s g cos(# 3 )]; or, more compactly as a s « a g + e g f3 g , where f3 g := 2irdcos{9 g ). As a result, 
using the approximation exp(— ja s ) ss exp[— j(a g + e^/? 9 )] ~ (1 — jtgPg) exp(— ja g ), the actual array 
response vector can be approximated as a linear function of e*; thus, it be expressed as 

a(0 a ) = a(0 9 ) + e a g 4>(0 g ) , <f>(9 g ) := [0 - j0 g e^ a ", . . . , -jf3 g (N r - i) e -^ 9 (^-i)]T (38) 

With columns obeying (l38l) . the actual array manifold is modeled as A Q = A+E^, where the perturbation 
matrix is structured as E^ = Yl g =i e gSf, with the A^ r x N g matrix having all zero entries, except 
for the g-th column that equals <t>{0 g ). With such an array manifold and S v = I, the grid-based DoA 
setup matches precisely the structured EIV model in d36l) . The simulated tests in the ensuing section will 
illustrate, among other things, the merits of employing WSS-TLS solvers to estimate x o t and e g based 
on data collected by possibly antenna arrays. But before this, a final remark is in order. 
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Remark 3: (Relationships with Kffii and /l22l/) Although is not explicitly included in the model of 
existing grid-based approaches, this mismatch has been mitigated either by iteratively refining the grid 
around the region where sources are present E2l . or, by invoking the minimum description length (MDL) 
test to estimate the number of actual sources N s , followed by spatial interpolation to estimate their Do As 
|[T6l . These remedies require post-processing the initial estimates obtained by sparse linear regression. 
In contrast, the proposed structured S-TLS based approach jointly estimates the nonzero support of x 0;i 
along with grid-induced perturbations. This allows for direct compensation of the angle errors to obtain 
high-resolution DoA estimates in a single step, and in certain cases without requiring multiple snapshots. 
Of course, multiple snapshots are expected to improve estimation performance using the matrix S-TLS 
solver mentioned in Remark [2] 

VII. Simulated Tests 

Four simulated tests are presented in this section to illustrate the merits of the S-TLS approach, starting 
from the algorithms of Section [TV] 

Test Case 1: (Optimum vs. suboptimum S-TLS) The EIV model in (0]) is simulated here with a 6 x 10 
matrix A, whose entries are i.i.d. Gaussian having variance 1/6, so that the expected ^-norm of each 
column equals 1. The entries of E^ and e y are also i.i.d. Gaussian with variance 0.0025/6 corresponding 
to entry-wise signal-to-noise ratio (SNR) of 26dB. Vector x D has only nonzero elements in the two first 
entries: x a> i = —1.3 and x Q ^ = 5. Algorithm 1 is tested with fi = 5 against Algorithm [2] implemented 
with different values of A to obtain a solution satisfying ||xs_ti,£||i = M- F° r variable tolerance values e 
in Algorithm 1 1-b I the attained minimum cost /(x) in (fTTb is plotted in Fig. [2] To serve as a benchmark, 
a genie-aided globally optimum scheme is also tested with the support of x known and equal to that of 
x D . Specifically, the genie-aided scheme minimizes /(x) over all points with ^i-norm equal to p,, and 
all entries being except for the first two. Using the equivalence between (fTTb and (fT2l . the genie-aided 
scheme per iteration amounts to minimizing a scalar quadratic program under linear constrains, which is 
solved efficiently using the interior-point optimization routine in |[27l . 

Fig. [2] shows that as e becomes smaller, the minimum achieved value /(x) decreases monotonically, 
and drops sharply to the global minimum attained by the genie-aided bisection scheme. Interestingly, 
the alternating descent algorithm that guarantees convergence to a stationary point, exhibits performance 
comparable to the global algorithm. For this reason, only the alternating descent algorithm is used in all 
subsequent tests. Next, S-TLS estimates are compared with those obtained via BP/Lasso and (regularized) 
TLS in the context of the CR sensing and array processing applications outlined in Section [VT] 
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Test Case 2: (S-TLS vs. Lasso vs. TLS) The setup here is also based on the EIV model dU), with A of 
size 20 x 40 having i.i.d. Gaussian entries; and x D having 5 nonzero i.i.d. standardized Gaussian entries. 
By averaging results over 200 Monte Carlo runs, the S-TLS solution is compared against the Lasso one 
for 20 values of A (uniformly spaced in log-scale), based on the £2, h, and £$ errors of the estimated 
vectors relative to x G . (The £q error equals the percentage of entries for which the support of the two 
vectors is different.) Fig. [3] corroborates the improvement of S-TLS over Lasso, especially in the £q norm. 
Fig. He) further demonstrates that over a range of moderate A values, S-TLS consistently outperforms 
Lasso in recovering the true support of x Q . For high A's, both estimates come close to the all-zero vector, 
so that the £q errors become approximately the same, even though the £2 and l\ errors are smaller for 
Lasso. However, for both error norms S-TLS has a slight edge over moderate values of A. 

Receiver operating characteristic (ROC) curves are plotted in Fig. [3]cl) to illustrate the merits of S- 
TLS and Lasso over (regularized) TLS in recovering the correct support. The "best" A for the S-TLS and 
Lasso algorithms is chosen using cross-validation ll25l . As TLS cannot be applied to under-determined 
systems, a 40 x 40 matrix A is selected. Since TLS and LS under an ^2-norm constraint ||x||2 < \i are 
known to be equivalent when \i is small ll26l . the regularized TLS is tested using the function lsqi 
for regularized LS from 11181 . The probability of correct detection, Pd, is calculated as the probability 
of identifying correctly the support over nonzero entries of x D , and the probability of false alarms, Pj a , 
as that of incorrectly deciding zero entries to be nonzero. The ROC curves in Fig. [3{d) demonstrate the 
advantage of Lasso, and more clearly that of S-TLS, in recovering the correct support. 
Test Case 3: (CR Spectrum Sensing) This simulation is performed with reference to the CR network 
in the region [0 1] x [0 1] in Fig. Q] The setup includes N r = 4 CRs deployed to estimate the power 
and location of a single source with position vector [0.4 0.6], located at the center of four neighboring 
grid points. The CRs scan K = 128 frequencies from 15MHz to 30MHz, and adopt the basis expansion 
model in Section |VH-A with iV& = 16 rectangular b v (f) functions, each of bandwidth 1MHz. The actual 
source only transmits over the v = 6-th band. The channel gains are exponentially decaying in distances 
with exponent —1/2. The received data are generated using the transmit PSD described earlier, a regular 
Rayleigh fading channel with 6 taps, and additive white Gaussian receiver noise at SNR=0dB. Receive- 
PSDs are obtained using exponentially weighted periodograms (with weight 0.99) averaged over 1,000 
coherence blocks; see also H for more details of a related simulation. The WSS-TLS approach is used 
to account for perturbations e s gr in the channel gains. A diagonal matrix W is used with each diagonal 
entry equal to a~ 2 (inversely proportional to the average of of sample variances of e s gr ). 

With A chosen as in ifTTI . both Lasso and WSS-TLS identify the active frequency band correctly (only 
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the entries {x g e}g=i were estimated as nonzero). However, Lasso identifies four transmitting sources at 
positions [0.3(0.5) 0.5(0.7)], the four grid points closest to [0.4 0.6]. WSS-TLS returns only one source 
at position [0.5 0.5], along with the estimated e gr that yields ^ sr = e gr + ^ gr . Concatenate the latter to 
form 7 S of length N r = 4 <C m. Using a refined grid of 25 points uniformly spaced over the "zoom-in" 
region [0.3 0.7] x [0.3 0.7] centered at [0.5 0.5], correlation coefficients between 7 S and those of each 
candidate point are evaluated. The source position is estimated as the point with maximum correlation 
coefficient, which for WSS-TLS occurs at the true location [0.4 0.6]. To illustrate graphically the two 
alternatives, the estimated maps of the spatial PSDs at the 6th frequency band are plotted in Fig. @Ia) 
using the Lasso, and in Fig. |Ub) using WSS-TLS. The marked point indicates the actual source location 
[0.4 0.6] in both maps. Unlike Lasso, the WSS-TLS identifies correctly the true position of the source. 
Test Case 4: (Do A Estimation) The setup here entails a ULA consisting of N r = 8 antenna elements 
with inter-element spacing d = 1/2, and a grid of N g = 90 scanning angles from —90° to 90° wrt the 
array boresight. Two sources (N s = 2) of unit amplitude impinge from angles $ s = 1° and —9°, both 1° 
off their nearest grid Do As. As in the single-snapshot test in 1221 . the SNR is set to 20dB. The variance 
of e s g in (l38l ) is obtained from the uniform distribution in [—1°, 1°]. Selecting A according to the noise 
level as in E2l . Lasso returns four nonzero entries, two around each source at $ s ± 1°; while WSS-TLS 
gives two nonzero 9 g estimates at —10° (g = 40) and 0° (g = 45), along with perturbation estimates e| 
and e| 5 . Using the latter, the DoAs are estimated as $ s := 9 g + e g for g = 40, 45. The angle spectra 
using Lasso, and WSS-TLS with estimated are compared in Fig. [3a). The two black arrows depict 
the actual source angles, and benchmark the true angular spectrum. 

To further illustrate the merits of WSS-TLS in estimating correctly the closest grid point and subse- 
quently each DoA, the sample variance of a DoA estimate is plotted versus SNR in Fig. Ob) using Monte 
Carlo runs, each with a single source randomly placed over [—1°, 1°]. Both WSS-TLS and Lasso are post- 
processed by interpolating peaks in the obtained spectra from two nearest grid points, linearly weighted 
by the estimated amplitudes as in fPTll . Both curves confirm that WSS-TLS outperforms the Lasso. 
More interestingly, the two WSS-TLS curves almost coincide, which further corroborates that WSS-TLS 
manages in a single step to identify correctly the support of x o t without requiring post processing. 

VIII. Concluding Remarks 

An innovative approach was developed in this paper to account for sparsity in estimating coefficient 
vectors of fully-perturbed linear regression models. This approach enriches TLS criteria that have been 
traditionally used to fit such models with the ability to handle under-determined linear systems. The 
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novel S-TLS framework also enables sparsity-exploiting approaches (CS, BP, and Lasso) to cope with 
perturbations present not only in the data but also in the regression matrix. 

Near-optimum and reduced-complexity suboptimum solvers with global and local convergence guar- 
antees were also developed to optimize the generally nonconvex S-TLS criteria. They rely on bisection, 
branch-and-bound, or coordinate descent iterations, and have universal applicability regardless of whether 
perturbations are modeled as deterministic or random. Valuable generalizations were also provided when 
prior information is available on the deterministic structure or statistics of the associated (augmented) data 
matrix. Under specific statistical models with errors-in-variables, the resultant (generally weighted and 
structured) S-TLS estimators were proved to be optimal in the MAP sense. Simulated tests corroborated 
the analytical claims, compared competing alternatives, and demonstrated the practical impact of the 
novel S-TLS framework to grid-based sparsity-exploiting approaches for cognitive radio sensing, and 
direction-of-arrival estimation with possibly uncalibrated antenna arrays. 

Interesting topics to explore in future research, include performance analysis for the proposed S-TLS 
algorithms, and online implementations for S-TLS optimal adaptive processing. 
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Algorithm 1-a (BB): Input y, A, a, and 8. Output a ^-optimal solution x* of (fl2l) 



Set X£ = — fil, xjj = fil, Q := {(x^,X[/, — oo}, and initialize with U = oo. 
repeat 

Let (xl,x;7,c) be one triplet of O with the smallest c; and set O = 0\(xl, xy, c). 
Solve ( TOI l locally to obtain x*. 
if .g(x*, a) <U then 

Set W = ,g(x*, a) and x* = x*. {update the minimum} 
end if 



Minimize globally the convex <7l(x, a) in ( fl4l i with the optimum D in (I151 . to obtain x* and £ := <?l(x* 



o. 



if W — £ > <5 {need to split} then 
Find i = argmax„([x;y]„ - [x L ] n ). 

Set Xi i (x^i) and x.l,2 (xc/,2) equal to x^ (xy) except for the i-th entry, {split the maximum separation} 

Set [Xi,i]i = [x L ].j, [x[/,i]j = ([x[/]i - [x L ]i)/2, [x Li2 ]i = ([*u]i - [xl]»)/2, and [x^ 2 ]i = [xt/]i- 
Augment the set of unsolved boxes = Q (J {(x^ 1, xj/,1, £), (xl^, x^2, £)}■ 
end if 
until = 



Algorithm 1-b (Bisection): Input y, A, and tolerances e and 5. Output an e-optimal solution x* to (fTOl ) 

1: Set Iq = 0, uq = \\yW2, iteration index i = 0, and initialize the achievable cost f m = uq with x* = nx i. 

2: repeat 

3: Let a = (/; + iii)/2 and call Algorithm II -al to find a feasible <5-optimal solution x* to (fl21l , 
4: Calculate / g = / (x*), and update the iteration i = i + 1. 
5: if / s < / m then 

Set / m = f g and xj = x*. {update the minimum} 
end if 

if g(x* g ,a) < then 

9: Update Ui ~ a and Z,j = 1 . 

10: else if .g(x*, a) > 6 then 

11: Update l. L = a and Ui = Ui-i. 
12: else 



Update l{ = a — 5 and u.; = Ut-i. 
end if 

Set = min(wi, 
until — k < e 



April 20, 2011 



DRAFT 



IEEE TRANSACTIONS ON SIGNAL PROCESSING (SUBMITTED) 



28 



Algorithm 2 (CD): Input y, A, and coefficient A. Output the iterates E(i) and x(i) upon convergence. 
Initialize with E(0) = TOX „ and x(— 1) = nxl 
for i = 0, 1, . . . do 

for v = 1, . . . , n do 

Compute the residual e v (i) as in < f20b . 
Update the scalar x v (i) via ( |21~1 >. 
end for 

Update the iterate E(i + 1) as in dl9l . 
end for 
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Fig. 1. Grid topology with N g = 25 candidate locations, N s = 1 transmitting source, and N r = 4 receiving CRs. 
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Fig. 2. Attained /(x) for variable tolerance values e by the global Algorithm II -bl compared to the alternating descent local 
algorithm, and the genie-aided global solver. 
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Fig. 4. Comparison between PSD maps estimated by (a) Lasso, and (b) WSS-TLS for the CR network in Fig. Q] 
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Fig. 5. (a) Angular spectra estimated using Lasso and WSS-TLS as compared to the actual transmission pattern; (b) Comparison 
of angle estimation variances of Lasso, WSS-TLS, without and with interpolation. 
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